Skip to content

fix #173. When wandb detected pass to Trainer(...,devices=1,...) othe…#174

Merged
rvandewater merged 1 commit intorvandewater:developmentfrom
rustamzh:fix-single-gpu-setup
Oct 22, 2025
Merged

fix #173. When wandb detected pass to Trainer(...,devices=1,...) othe…#174
rvandewater merged 1 commit intorvandewater:developmentfrom
rustamzh:fix-single-gpu-setup

Conversation

@rustamzh
Copy link
Copy Markdown
Contributor

@rustamzh rustamzh commented Oct 22, 2025

When wandb detected pass to Trainer(...,devices=1,...) otherwise wandb creates multiple folders and crashes. Fixes #173.

Summary by CodeRabbit

  • Chores
    • GPU device allocation for training is now dynamically managed based on runtime integration configuration. When weights & biases is active for experiment tracking and monitoring, training restricts operations to a single GPU for consistency; otherwise, training utilizes all available GPUs to maximize parallel execution capabilities and significantly improve overall computational throughput.

…1,...) otherwise wandb creates multiple folders and crashes.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 22, 2025

Walkthrough

Modified GPU device management in the training module to conditionally restrict to a single GPU when using wandb. Addresses multi-GPU sweep crashes by setting devices = 1 when wandb is enabled, falling back to multi-GPU usage otherwise.

Changes

Cohort / File(s) Summary
GPU device management
icu_benchmarks/models/train.py
Added conditional logic to manage GPU device count: initializes to max(cuda_device_count, 1), then restricts to devices = 1 when wandb is active to prevent multi-GPU sweep crashes. Updated trainer initialization to use the managed devices variable.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 When sweeping through the GPU garden, one runs faster than many—
A wandb waltz needs focus, so we set the device to one, not any.
Multi-GPU melted, now single GPU glows bright, 🎆
The trainer hops smoothly through the PyTorch night! ✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix #173. When wandb detected pass to Trainer(...,devices=1,...) othe…" is directly related to the main change in the changeset. The title clearly references the specific fix being implemented: when wandb is detected, the Trainer is invoked with devices=1 to prevent multi-GPU issues. This aligns with the raw summary which shows the introduction of a devices variable that restricts to a single GPU when wandb is used. The title is concise, specific about the change, and references the linked issue.
Linked Issues Check ✅ Passed The code changes directly address the primary objective of issue #173, which is to prevent wandb from creating multiple folders and crashing on multi-GPU systems by restricting execution to a single GPU. The implementation introduces a devices variable that sets devices to 1 when wandb is detected, which corresponds precisely to the first proposed solution mentioned in the issue. The change modifies the Trainer initialization to use this variable, ensuring that when wandb is active, the training runs with a single GPU device as required.
Out of Scope Changes Check ✅ Passed The changes are focused and in-scope, consisting solely of modifications to icu_benchmarks/models/train.py related to the devices variable management logic for handling wandb usage. The raw summary indicates no alterations to exported or public entities, and all changes directly support the objective of fixing the wandb multi-GPU crash issue by restricting to a single GPU. No unrelated code modifications or out-of-scope changes are apparent in the changeset.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
icu_benchmarks/models/train.py (1)

157-158: Pre-existing bug: Condition always evaluates to True.

This is not part of your changes, but there's a logic error here. The string "16-mixed" is always truthy, so this condition always executes regardless of the precision value.

Should be:

-    if precision == 16 or "16-mixed":
+    if precision == 16 or precision == "16-mixed":

Note: This bug may have minimal impact since setting matmul precision to "medium" is often desirable anyway, but the logic should still be corrected for clarity and correctness.

🧹 Nitpick comments (1)
icu_benchmarks/models/train.py (1)

145-148: Optional: Consider clarifying the wandb logging message.

The device restriction correctly prevents wandb from creating multiple folders on multi-GPU systems. The implementation is sound.

The log message is accurate but could be more specific:

-        logging.info("Use of wandb is detected. Only single gpu training is supported with wandb.")
+        logging.info("Wandb detected. Restricting training to a single device to prevent sweep crashes.")

This makes the workaround nature explicit and uses "device" instead of "gpu" for accuracy across hardware types.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c25cb7 and 96b8cbd.

📒 Files selected for processing (1)
  • icu_benchmarks/models/train.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
icu_benchmarks/models/train.py (1)
icu_benchmarks/imputation/simple_diffusion.py (1)
  • on_fit_start (117-126)
🔇 Additional comments (2)
icu_benchmarks/models/train.py (2)

143-144: Device initialization logic looks good.

The initialization correctly defaults to at least 1 device and uses the available CUDA device count when GPUs are present. This provides a sensible baseline before the wandb-specific override.


166-166: Correct usage of the computed devices variable.

The dynamic device assignment properly integrates with the wandb-aware device management logic introduced earlier.

@rvandewater rvandewater merged commit 32760d4 into rvandewater:development Oct 22, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When running on multi-gpu systems the sweep crashes

2 participants